2025-05-26-20-00
Streaming Attention Approximation via Discrepancy Theory
Abstract
arXiv:2502.07861v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have achieved impressive success, but their high memory requirements present challenges for long-context token generation. In this paper we study the streaming complexity of attention approximation, a key computational primitive underlying token generation. Our main contribution is BalanceKV, a streaming algorithm for -approximating attention computations based on geometric process for selecting a balanced collection of Key and Value tokens as per Banaszczyk's vector balancing theory. We complement our algorithm with space lower bounds for streaming attention computation. Besides strong theoretical guarantees, BalanceKV exhibits empirically validated performance improvements over existing methods, both for attention approximation and end-to-end performance on various long context benchmarks.
摘要
大型语言模型(LLMs)已取得显著成功,但其高内存需求对长上下文标记生成提出了挑战。本文研究了注意力近似计算的流式复杂度,这是标记生成背后的关键计算原语。我们的主要贡献是BalanceKV算法——一种基于几何过程的流式算法,用于实现近似的注意力计算。该算法依据Banaszczyk向量平衡理论,通过选择平衡的键值标记集合来实现优化。我们通过流式注意力计算的空间下界分析对算法进行了理论补充。除强理论保证外,BalanceKV在注意力近似计算及多种长上下文基准测试的端到端性能方面,均展现出经实证验证的优越性,显著超越现有方法。